Distance Dimension Reduction on QR Factorization for Efficient Clustering Semantic XML Document Using the QR Fuzzy C-Mean (QR-FCM)
نویسندگان
چکیده
The rapid growth of XML adoption has urged for the need of a proper representation for semi-structured documents, where the document semantic structural information has to be taken into account so as to support more precise document analysis. In order to analyze the information represented in XML documents efficiently, researches on XML document clustering are actively in progress. The key issue is how to devise the similarity measure between XML documents to be used for clustering. Since XML documents have hierarchical structure, it is not appropriate to cluster them by using a general document similarity measure. Dimension reduction plays an important role in handling the massive quantity of high dimensional data such as mass semantic structural documents. In this paper, we introduce distance dimension reduction (DDR) based on the QR factorization (DDR/QR) or the Cholesky factorization (DDR/C). DDR generates lower dimensional representations of the high-dimensional XML document, which can exactly preserve Euclidean distances and cosine similarities between any pair of XML documents in the original dimensional space. After projecting XML documents to the lower dimensional space obtained from DDR, our proposed method QR fuzzy c-mean to execute the document-analysis clustering algorithms (we called the QR-FCM). DDR can substantially reduce the computing time and/or memory requirement of a given document-analysis clustering algorithm, especially when we need to run the document analysis algorithm many times for estimating parameters or searching for a better solution.
منابع مشابه
Image Compression Method Based on QR-Wavelet Transformation
In this paper, a procedure is reported that discuss how linear algebra can be used in image compression. The basic idea is that each image can be represented as a matrix. We apply linear algebra (QR factorization and wavelet transformation algorithms) on this matrix and get a reduced matrix out such that the image corresponding to this reduced matrix requires much less storage space than th...
متن کاملQR Factorization Based Blind Channel Identification and Equalization with Second-Order Statistics
Most eigenstructure-based blind channel identification and equalization algorithms with second-order statistics need SVD or EVD of the correlation matrix of the received signal. In this paper, we address new algorithms based on QR factorization of the received signal directly without calculating the correlation matrix. This renders the QR factorization-based algorithms more robust against ill-c...
متن کاملEfficient Kernel Discriminant Analysis via QR Decomposition
Linear Discriminant Analysis (LDA) is a well-known method for feature extraction and dimension reduction. It has been used widely in many applications such as face recognition. Recently, a novel LDA algorithm based on QR Decomposition, namely LDA/QR, has been proposed, which is competitive in terms of classification accuracy with other LDA algorithms, but it has much lower costs in time and spa...
متن کاملMATHEMATICAL ENGINEERING TECHNICAL REPORTS CholeskyQR2: A Simple and Communication-Avoiding Algorithm for Computing a Tall-Skinny QR Factorization on a Large-Scale Parallel System
Designing communication-avoiding algorithms is crucial for high performance computing on a largescale parallel system. The TSQR algorithm is a communication-avoiding algorithm for computing a tall-skinny QR factorization, and TSQR is known to be much faster and as stable as the classical Householder QR algorithm. The Cholesky QR algorithm is another very simple and fast communication-avoiding a...
متن کاملOn Mixed and Componentwise Condition Numbers for Hyperbolic Qr Factorization
We present normwise and componentwise perturbation bounds for the hyperbolic QR factorization by using a new approach. The explicit expressions of mixed and componentwise condition numbers for the hyperbolic QR factorization are derived.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009